Sunday, July 13, 2008

Assessing the correctness of scientific computation

In the previous post I promised some thoughts on assessing the correctness of scientific computation.


Computational science programs are typically assessed with a series of test cases, ranging from order-of-magnitude estimates ("those numbers don't look right") to comparing with experimental data. I've attempted to divide these test cases into categories, where each category shares similar expectations as to what 'close enough' means and has similar implications into potential causes for discrepancies.




  1. Order of magnitude estimates from general principles, back of the envelope calculations, or experience with similar problems.
  2. Analytically solvable cases.
  3. Small cases that are accurately solvable
    For small systems, or simple cases, there are often multiple solution methods, and some of them are very accurate (or exact). The methods can be compared for these small systems. An important feature of this category is any approximations can be controlled and the effects made arbitrarily small.
  4. Results from the same (or similar) algorithm
    Comparing the same system with the same algorithm should yield close results, but now there is additional uncertainty in the exact implementation, and the sensitivity to input precision (particularly when comparing results from a journal article)
  5. Results from a different algorithm
    This looks similar to the second category (comparing with exact or more precise algorithms), but this is the case where there a multiple methods (with different approximations) for handling larger systems. There are likely to be more approximations (and probably uncontrolled approximations) involved.

    (In electronic structure there is Hartree-Fock(HF), several post-HF methods, Density Functional Theory, and Quantum Monte Carlo. And possibly a few others.) Now one has to deal with the all the above possibilities for two programs, rather than just one.
  6. Experimental results
    In some ways this is quite similar to the the previous category, except determining "sufficiently" close requires some understanding of the experimental uncertainties and corrections in addition to possible program errors.



The testing process for each case involves running the test case, deciding whether the program result is 'close enough'. If it is, proceed to the next test case. If it is not, work to discover the cause of the discrepancy. Next post I'll look at a list of possible causes.

Saturday, July 12, 2008

Engineering scientific software

Some pointers from Greg Wilson on engineering scientific software.


The first pointer is to the "First International Workshop on Software Engineering for Computational Science and Engineering".


A couple of Wilson's short summaries of papers from this workshop
(from recent research reading)

Diane Kelly and Rebecca Sanders: “Assessing the Quality of Scientific Software” (SE-CSE workshop, 2008). A short, useful summary of scientists’ quality assurance practices. As many people have noted, most commercial tools aren’t helpful when (a) cosmetic rearrangement of the code changes the output, and (b) the whole reason you’re writing a program is that you can’t figure out what the answer ought to be.


Judith Segal: “Models of Scientific Software Development”. Segal has spent the last few years doing field studies of scientists building software, and based on those has developed a model of their processes that is distinct in several ways from both classical and agile models.


What I found interesting:

Kelly and Sanders suggest that the split between verification and validation is not useful, and propose 'assessment' be used as an umbrella term. (I will return to the point in a later post)



The second pointer is to the presentations at TACC Scientific Software Days.


One of the presentations there was by Greg Wilson himself on "HPC Considered Harmful". There is one slide on testing floating point code (slide 15). In particular the bullet point testing against a tolerance is "superstition, not science". (And he's raised this point elsewhere)


This statement has bothered me since I first read it, and so the next few posts will explore some of context surrounding this question. The first post will list different methods for assessing the correctness of a program, the next post will categorize possible errors, and finally a post looking at 'close enough'.